vh 1
Regret Bounds for Kernel-Based Reinforcement Learning
Domingues, Omar Darwiche, Ménard, Pierre, Pirotta, Matteo, Kaufmann, Emilie, Valko, Michal
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. Unlike existing approaches with regret guarantees, it does not use any kind of partitioning of the state-action space. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $O\left( H^3 K^{\max\left(\frac{1}{2}, \frac{2d}{2d+1}\right)}\right)$, where $d$ is the covering dimension of the joint state-action space. We empirically validate Kernel-UCBVI on discrete and continuous MDPs.
Zooming for Efficient Model-Free Reinforcement Learning in Metric Spaces
Touati, Ahmed, Taiga, Adrien Ali, Bellemare, Marc G.
Despite the wealth of research into provably efficient reinforcement learning algorithms, most works focus on tabular representation and thus struggle to handle exponentially or infinitely large state-action spaces. In this paper, we consider episodic reinforcement learning with a continuous state-action space which is assumed to be equipped with a natural metric that characterizes the proximity between different states and actions. We propose ZoomRL, an online algorithm that leverages ideas from continuous bandits to learn an adaptive discretization of the joint space by zooming in more promising and frequently visited regions while carefully balancing the exploitation-exploration trade-off. We show that ZoomRL achieves a worst-case regret $\tilde{O}(H^{\frac{5}{2}} K^{\frac{d+1}{d+2}})$ where $H$ is the planning horizon, $K$ is the number of episodes and $d$ is the covering dimension of the space with respect to the metric. Moreover, our algorithm enjoys improved metric-dependent guarantees that reflect the geometry of the underlying space. Finally, we show that our algorithm is robust to small misspecification errors.
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted $\chi^2$-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted $\chi^2$-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.
Optimistic Policy Optimization with Bandit Feedback
Efroni, Yonathan, Shani, Lior, Rosenberg, Aviv, Mannor, Shie
Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. Yet, so far, such methods have been mostly analyzed from an optimization perspective, without addressing the problem of exploration, or by making strong assumptions on the interaction with the environment. In this paper we consider model-based RL in the tabular finite-horizon MDP setting with unknown transitions and bandit feedback. For this setting, we propose an optimistic trust region policy optimization (TRPO) algorithm for which we establish $\tilde O(\sqrt{S^2 A H^4 K})$ regret for stochastic rewards. Furthermore, we prove $\tilde O( \sqrt{ S^2 A H^4 } K^{2/3} ) $ regret for adversarial rewards. Interestingly, this result matches previous bounds derived for the bandit feedback case, yet with known transitions. To the best of our knowledge, the two results are the first sub-linear regret bounds obtained for policy optimization algorithms with unknown transitions and bandit feedback.
Worst-Case Regret Bounds for Exploration via Randomized Value Functions
Exploration is one of the central challenges in reinforcement learning (RL). A large theoretical literature treats exploration in simple finite state and action MDPs, showing that it is possible to efficiently learn a near optimal policy through interaction alone [5, 8, 10, 11, 13-16, 24, 25]. Overwhelmingly, this literature focuses on optimistic algorithms, with most algorithms explicitly maintaining uncertainty sets that are likely to contain the true MDP. It has been difficult to adapt these exploration algorithms to the more complex problems investigated in the applied RL literature. Most applied papers seem to generate exploration through ǫ-greedy or Boltzmann exploration. Those simple methods are compatible with practical value function learning algorithms, which use parametric approximations to value functions to generalize across high dimensional state spaces. Unfortunately, such exploration algorithms can fail catastrophically in simple finite state MDPs [See e.g.
Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs
Simchowitz, Max, Jamieson, Kevin
Reinforcement learning (RL) is a powerful paradigm for modeling a learning agent's interactions with an unknown environment, in an attempt to accumulate as much reward as possible. Because of its flexibility, RL can encode such a vast array of different problem settings - many of which are entirely intractable. Therefore, it is crucial to understand what conditions make it possible for an RL agent to effectively learn about its environment. In this paper, we consider tabular Markov decision processes (MDPs), a canonical RL setting where the agent seeks to learn a policy mapping discrete states x S to one of finitely many actions a A, in attempt to maximize cumulative reward over an episode horizon H. We shall study the regret setting, where the learner plays a policy π for a sequence of episodes k 1, . . .
Efficient Model-free Reinforcement Learning in Metric Spaces
Model-free Reinforcement Learning (RL) algorithms such as Q-learning [Watkins, Dayan 92] have been widely used in practice and can achieve human level performance in applications such as video games [Mnih et al. 15]. Recently, equipped with the idea of optimism in the face of uncertainty, Q-learning algorithms [Jin, Allen-Zhu, Bubeck, Jordan 18] can be proven to be sample efficient for discrete tabular Markov Decision Processes (MDPs) which have finite number of states and actions. In this work, we present an efficient model-free Q-learning based algorithm in MDPs with a natural metric on the state-action space--hence extending efficient model-free Q-learning algorithms to continuous state-action space. Compared to previous model-based RL algorithms for metric spaces [Kakade, Kearns, Langford 03], our algorithm does not require access to a black-box planning oracle.
Stochastic Lipschitz Q-Learning
In an episodic Markov Decision Process (MDP) problem, an online algorithm chooses from a set of actions in a sequence of $H$ trials, where $H$ is the episode length, in order to maximize the total payoff of the chosen actions. Q-learning, as the most popular model-free reinforcement learning (RL) algorithm, directly parameterizes and updates value functions without explicitly modeling the environment. Recently, [Jin et al. 2018] studies the sample complexity of Q-learning with finite states and actions. Their algorithm achieves nearly optimal regret, which shows that Q-learning can be made sample efficient. However, MDPs with large discrete states and actions [Silver et al. 2016] or continuous spaces [Mnih et al. 2013] cannot learn efficiently in this way. Hence, it is critical to develop new algorithms to solve this dilemma with provable guarantee on the sample complexity. With this motivation, we propose a novel algorithm that works for MDPs with a more general setting, which has infinitely many states and actions and assumes that the payoff function and transition kernel are Lipschitz continuous. We also provide corresponding theory justification for our algorithm. It achieves the regret $\tilde{\mathcal{O}}(K^{\frac{d+1}{d+2}}\sqrt{H^3}),$ where $K$ denotes the number of episodes and $d$ denotes the dimension of the joint space. To the best of our knowledge, this is the first analysis in the model-free setting whose established regret matches the lower bound up to a logarithmic factor.